{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* By: Dominik Dario Dabrowski\n",
    "* Email: dominik@doda.co\n",
    "* Reference: Advances in Financial Machine Learning, Chapter 8, Feature Importance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chapter 8 Feature Imprortance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the most pervasive mistakes in financial research is to take some data, run it through an ML algorithm, backtest the predictions, and repeat the sequence until a nice-looking backtest shows up. Academic journals are filled with such pseudo-discoveries, and even large hedge funds constantly fall into this trap.\n",
    "\n",
    "It typically takes about 20 such iterations to discover a (false) investment strategy subject to the standard significance level (false positive ratio) of 5%. In this chapter we will explore why such an approach is a waste of time and money, and how feature importance offers an alternative."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Marcos' first law of backtesting:**\n",
    "\n",
    "**Backtesting is not a research tool. Feature importance is.**\n",
    "\n",
    "\n",
    "Once we have found what features are important, we can learn more by conducting a number of experiments.\n",
    "\n",
    "- Are these features important all the time, or only in some specific environments?\n",
    "- What triggers a change in importance over time?\n",
    "- Can these regime switches be predicted?\n",
    "- Are those important features also relevant to other related financial instruments?\n",
    "- Are they relevant to other asset classes?\n",
    "- What are the most relevant features across all financial instruments?\n",
    "- What is the subset of features with the  highest rank correlation across the entire investment universe?\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as mpl\n",
    "\n",
    "%matplotlib inline\n",
    "mpl.rcParams['figure.figsize'] = (16, 6)\n",
    "\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import BaggingClassifier\n",
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "from mlfinlab.feature_importance import (\n",
    "    mean_decrease_impurity,\n",
    "    mean_decrease_accuracy,\n",
    "    single_feature_importance,\n",
    "    plot_feature_importance,\n",
    "    get_orthogonal_features,\n",
    ")\n",
    "\n",
    "from mlfinlab.cross_validation import PurgedKFold, ml_cross_val_score\n",
    "from mlfinlab.util.multiprocess import process_jobs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_classification\n",
    "\n",
    "\n",
    "def get_test_data(n_features=40, n_informative=10, n_redundant=10, n_samples=10000):\n",
    "    # generate a random dataset for a classification problem    \n",
    "    trnsX, cont = make_classification(n_samples=n_samples, n_features=n_features, n_informative=n_informative, n_redundant=n_redundant, random_state=0, shuffle=False)\n",
    "    df0 = pd.DatetimeIndex(periods=n_samples, freq=pd.tseries.offsets.Minute(), end=pd.datetime.today())\n",
    "    trnsX = pd.DataFrame(trnsX, index=df0)\n",
    "    cont = pd.Series(cont, index=df0).to_frame('bin')\n",
    "    df0 = ['I_%s' % i for i in range(n_informative)] + ['R_%s' % i for i in range(n_redundant)]\n",
    "    df0 += ['N_%s' % i for i in range(n_features - len(df0))]\n",
    "    trnsX.columns = df0\n",
    "    cont['w'] = 1.0 / cont.shape[0]\n",
    "    cont['t1'] = pd.Series(cont.index, index=cont.index)\n",
    "    return trnsX, cont"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def feature_importances(X, cont, method, allow_masking_effects=False, n_splits=10):\n",
    "    max_features = None if allow_masking_effects else 1\n",
    "    clf = DecisionTreeClassifier(\n",
    "        criterion='entropy', max_features=max_features, class_weight='balanced', min_weight_fraction_leaf=0.0\n",
    "    )\n",
    "    clf = BaggingClassifier(\n",
    "        base_estimator=clf, n_estimators=1000, max_features=1.0, max_samples=1.0, oob_score=True, n_jobs=-1\n",
    "    )\n",
    "    fit = clf.fit(X, cont['bin'])\n",
    "    oob_score = fit.oob_score_\n",
    "\n",
    "    cv_gen = PurgedKFold(n_splits=n_splits, samples_info_sets=cont['t1'])\n",
    "    oos_score = ml_cross_val_score(clf, X, cont['bin'], cv_gen=cv_gen, scoring=accuracy_score).mean()\n",
    "\n",
    "    if method == 'MDI':\n",
    "        imp = mean_decrease_impurity(fit, X.columns)\n",
    "    elif method == 'MDA':\n",
    "        imp = mean_decrease_accuracy(clf, X, cont['bin'], cv_gen, scoring=accuracy_score)\n",
    "    elif method == 'SFI':\n",
    "        imp = single_feature_importance(clf, X, cont['bin'], cv_gen, scoring=accuracy_score)\n",
    "    \n",
    "    return imp, oob_score, oos_score\n",
    "\n",
    "\n",
    "def test_data_func(X, cont, run='', allow_masking_effects=False, methods=['MDI', 'MDA', 'SFI']):\n",
    "    for method in methods:\n",
    "        feature_imp, oob_score, oos_score = feature_importances(X, cont, method, allow_masking_effects)\n",
    "\n",
    "        plot_feature_importance(\n",
    "            feature_imp, oob_score=oob_score, oos_score=oos_score,\n",
    "            save_fig=True, output_path='img/{}_feat_imp{}.png'.format(method, run)\n",
    "        )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.1a\n",
    "\n",
    "Using the code presented in Section 8.6\n",
    "\n",
    "Generate a dataset $(X, y)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>I_0</th>\n",
       "      <th>I_1</th>\n",
       "      <th>I_2</th>\n",
       "      <th>I_3</th>\n",
       "      <th>R_0</th>\n",
       "      <th>R_1</th>\n",
       "      <th>R_2</th>\n",
       "      <th>R_3</th>\n",
       "      <th>N_0</th>\n",
       "      <th>N_1</th>\n",
       "      <th>N_2</th>\n",
       "      <th>N_3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:29:14.539910</th>\n",
       "      <td>-3.941539</td>\n",
       "      <td>-1.955124</td>\n",
       "      <td>-1.247683</td>\n",
       "      <td>-0.665536</td>\n",
       "      <td>2.870924</td>\n",
       "      <td>0.706670</td>\n",
       "      <td>-0.144982</td>\n",
       "      <td>-1.498281</td>\n",
       "      <td>-0.229430</td>\n",
       "      <td>0.177231</td>\n",
       "      <td>0.648948</td>\n",
       "      <td>-0.818646</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:30:14.539910</th>\n",
       "      <td>-2.882175</td>\n",
       "      <td>-1.822702</td>\n",
       "      <td>-0.568862</td>\n",
       "      <td>0.103451</td>\n",
       "      <td>2.196651</td>\n",
       "      <td>0.966482</td>\n",
       "      <td>-0.527894</td>\n",
       "      <td>-1.100332</td>\n",
       "      <td>0.130209</td>\n",
       "      <td>-0.831310</td>\n",
       "      <td>1.484291</td>\n",
       "      <td>0.320911</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:31:14.539910</th>\n",
       "      <td>-1.897824</td>\n",
       "      <td>-0.659752</td>\n",
       "      <td>-0.575968</td>\n",
       "      <td>1.432049</td>\n",
       "      <td>1.647345</td>\n",
       "      <td>0.800773</td>\n",
       "      <td>-0.995133</td>\n",
       "      <td>-1.899108</td>\n",
       "      <td>-1.667659</td>\n",
       "      <td>-0.005389</td>\n",
       "      <td>2.347850</td>\n",
       "      <td>0.202494</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:32:14.539910</th>\n",
       "      <td>-2.574587</td>\n",
       "      <td>1.990887</td>\n",
       "      <td>0.383741</td>\n",
       "      <td>3.980372</td>\n",
       "      <td>3.637930</td>\n",
       "      <td>0.773705</td>\n",
       "      <td>-1.803899</td>\n",
       "      <td>-6.490407</td>\n",
       "      <td>0.105738</td>\n",
       "      <td>1.093880</td>\n",
       "      <td>-0.037027</td>\n",
       "      <td>-1.414238</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:33:14.539910</th>\n",
       "      <td>-1.885823</td>\n",
       "      <td>-2.601728</td>\n",
       "      <td>-1.325420</td>\n",
       "      <td>-0.736274</td>\n",
       "      <td>0.621126</td>\n",
       "      <td>0.755418</td>\n",
       "      <td>-0.237697</td>\n",
       "      <td>1.156626</td>\n",
       "      <td>-1.178807</td>\n",
       "      <td>0.069023</td>\n",
       "      <td>0.454516</td>\n",
       "      <td>-0.522534</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 I_0       I_1       I_2       I_3       R_0  \\\n",
       "2020-02-16 05:29:14.539910 -3.941539 -1.955124 -1.247683 -0.665536  2.870924   \n",
       "2020-02-16 05:30:14.539910 -2.882175 -1.822702 -0.568862  0.103451  2.196651   \n",
       "2020-02-16 05:31:14.539910 -1.897824 -0.659752 -0.575968  1.432049  1.647345   \n",
       "2020-02-16 05:32:14.539910 -2.574587  1.990887  0.383741  3.980372  3.637930   \n",
       "2020-02-16 05:33:14.539910 -1.885823 -2.601728 -1.325420 -0.736274  0.621126   \n",
       "\n",
       "                                 R_1       R_2       R_3       N_0       N_1  \\\n",
       "2020-02-16 05:29:14.539910  0.706670 -0.144982 -1.498281 -0.229430  0.177231   \n",
       "2020-02-16 05:30:14.539910  0.966482 -0.527894 -1.100332  0.130209 -0.831310   \n",
       "2020-02-16 05:31:14.539910  0.800773 -0.995133 -1.899108 -1.667659 -0.005389   \n",
       "2020-02-16 05:32:14.539910  0.773705 -1.803899 -6.490407  0.105738  1.093880   \n",
       "2020-02-16 05:33:14.539910  0.755418 -0.237697  1.156626 -1.178807  0.069023   \n",
       "\n",
       "                                 N_2       N_3  \n",
       "2020-02-16 05:29:14.539910  0.648948 -0.818646  \n",
       "2020-02-16 05:30:14.539910  1.484291  0.320911  \n",
       "2020-02-16 05:31:14.539910  2.347850  0.202494  \n",
       "2020-02-16 05:32:14.539910 -0.037027 -1.414238  \n",
       "2020-02-16 05:33:14.539910  0.454516 -0.522534  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X, cont = get_test_data(n_features=12, n_informative=4, n_redundant=4, n_samples=5000)\n",
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.1b\n",
    "\n",
    "Using the code presented in Section 8.6\n",
    "\n",
    "Apply a PCA transformation on X, which we denote $\\dot X$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PCA_0</th>\n",
       "      <th>PCA_1</th>\n",
       "      <th>PCA_2</th>\n",
       "      <th>PCA_3</th>\n",
       "      <th>PCA_4</th>\n",
       "      <th>PCA_5</th>\n",
       "      <th>PCA_6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:29:14.539910</th>\n",
       "      <td>2.065379</td>\n",
       "      <td>0.146105</td>\n",
       "      <td>-1.182983</td>\n",
       "      <td>0.134590</td>\n",
       "      <td>0.530868</td>\n",
       "      <td>-0.363318</td>\n",
       "      <td>0.852387</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:30:14.539910</th>\n",
       "      <td>1.839769</td>\n",
       "      <td>0.822382</td>\n",
       "      <td>-0.949730</td>\n",
       "      <td>1.217152</td>\n",
       "      <td>-0.766848</td>\n",
       "      <td>-0.026661</td>\n",
       "      <td>1.015070</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:31:14.539910</th>\n",
       "      <td>1.955510</td>\n",
       "      <td>1.095325</td>\n",
       "      <td>0.280685</td>\n",
       "      <td>2.428007</td>\n",
       "      <td>0.794605</td>\n",
       "      <td>-0.732763</td>\n",
       "      <td>1.020887</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:32:14.539910</th>\n",
       "      <td>4.569843</td>\n",
       "      <td>1.217288</td>\n",
       "      <td>1.651004</td>\n",
       "      <td>-1.110274</td>\n",
       "      <td>1.290850</td>\n",
       "      <td>-0.047355</td>\n",
       "      <td>0.666502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-02-16 05:33:14.539910</th>\n",
       "      <td>0.234743</td>\n",
       "      <td>0.652217</td>\n",
       "      <td>-1.114901</td>\n",
       "      <td>0.621042</td>\n",
       "      <td>0.784154</td>\n",
       "      <td>-0.946805</td>\n",
       "      <td>0.232960</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               PCA_0     PCA_1     PCA_2     PCA_3     PCA_4  \\\n",
       "2020-02-16 05:29:14.539910  2.065379  0.146105 -1.182983  0.134590  0.530868   \n",
       "2020-02-16 05:30:14.539910  1.839769  0.822382 -0.949730  1.217152 -0.766848   \n",
       "2020-02-16 05:31:14.539910  1.955510  1.095325  0.280685  2.428007  0.794605   \n",
       "2020-02-16 05:32:14.539910  4.569843  1.217288  1.651004 -1.110274  1.290850   \n",
       "2020-02-16 05:33:14.539910  0.234743  0.652217 -1.114901  0.621042  0.784154   \n",
       "\n",
       "                               PCA_5     PCA_6  \n",
       "2020-02-16 05:29:14.539910 -0.363318  0.852387  \n",
       "2020-02-16 05:30:14.539910 -0.026661  1.015070  \n",
       "2020-02-16 05:31:14.539910 -0.732763  1.020887  \n",
       "2020-02-16 05:32:14.539910 -0.047355  0.666502  \n",
       "2020-02-16 05:33:14.539910 -0.946805  0.232960  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Xdot = pd.DataFrame(get_orthogonal_features(X), index=X.index).add_prefix(\"PCA_\")\n",
    "Xdot.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.1c\n",
    "\n",
    "Using the code presented in Section 8.6\n",
    "\n",
    "Compute MDI, MDA, and SFI feature importance on $(\\dot X, y)$, where the base estimator is a RF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_func(Xdot, cont, '_8.1c')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.1c2.png)\n",
    "![title](img/MDA_feat_imp_8.1c2.png)\n",
    "![title](img/SFI_feat_imp_8.1c2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.1d\n",
    "\n",
    "Using the code presented in Section 8.6\n",
    "\n",
    "Do the three methods agree on what features are important? Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A: PCA successfully helped us reduce our data from 12 features to 7 and across those 7 features, our 3 feature importance methods agreed that the first few principal components (PCA_{0, 1,2}) are the most important.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.2a\n",
    "\n",
    "From exercise 1, generate a new dataset $(\\ddot X, y)$, where $\\ddot X$ is a feature union of $X$ and $\\dot X$.\n",
    "\n",
    "Compute MDI, MDA, and SFI feature importance on $(\\ddot X, y)$, where the base estimator is a RF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xdotdot = pd.concat([X, Xdot], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_func(Xdotdot, cont, '_8.2a')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.2a.png)\n",
    "![title](img/MDA_feat_imp_8.2a.png)\n",
    "![title](img/SFI_feat_imp_8.2a.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.2b\n",
    "\n",
    "From exercise 1, generate a new dataset $(\\ddot X, y)$, where $\\ddot X$ is a feature union of $X$ and $\\dot X$.\n",
    "\n",
    "Do the three methods agree on what features are important? Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A: MDI & SFI rank untransformed informative & redundant features above noisy ones and the first principal components over latter ones. MDA in this case does not seem to be able to rank the features correctly.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.3a\n",
    "\n",
    "Take the results from exercise 2: \n",
    "\n",
    "Drop the most important features according to each method, resulting in a features matrix $\\dddot X$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "most_important_features = ['I_2', 'PCA_1', 'PCA_0', 'R_2', 'I_1']\n",
    "Xdotdotdot = Xdotdot.loc[:, ~Xdotdot.columns.isin(most_important_features)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.3b\n",
    "\n",
    "Take the results from exercise 2: \n",
    "\n",
    "Compute MDI, MDA, and SFI feature importance on $(\\dddot X, y)$, where the base estimator is a RF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_func(Xdotdotdot, cont, '_8.3b')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.3c\n",
    "\n",
    "Take the results from exercise 2: \n",
    "\n",
    "Do you appreciate significant changes in the rankings of important features, relative to the results from exercise 2?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.3b.png)\n",
    "![title](img/MDA_feat_imp_8.3b.png)\n",
    "![title](img/SFI_feat_imp_8.3b.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A: MDI & SFI seem unperturbed, while MDA has shifted a lot and now assigns positive feature importance to all remaining first principal components and informative and redundant features.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.4a\n",
    "\n",
    "Using the code presented in Section 8.6:\n",
    "\n",
    "Generate a dataset $(X, y)$ of 1E6 observations, where 5 features are informative, 5 are redundant and 10 are noise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 10000\n",
    "X, cont = get_test_data(n_features=20, n_informative=5, n_redundant=5, n_samples=n_samples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.4b\n",
    "\n",
    "Using the code presented in Section 8.6:\n",
    "\n",
    "Split $(X, y)$ into 10 datasets, each of 1E5 observations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A: Implemented in the next answer.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.4c\n",
    "\n",
    "Using the code presented in Section 8.6:\n",
    "\n",
    "Compute the parallelized feature importance on each of the 10 datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "def combine_imps(imps):\n",
    "    return pd.DataFrame({\n",
    "        'mean': pd.concat([x['mean'] for x in imps], axis=1).mean(axis=1),\n",
    "        'std': pd.concat([x['std'] for x in imps], axis=1).mean(axis=1),\n",
    "    })\n",
    "\n",
    "def chunked_test_data_func(X, cont, n_chunks=1, run='', allow_masking_effects=False, methods=['MDI', 'MDA', 'SFI'], num_threads=32):\n",
    "    # In order to avoid some multi-processing bugs our parallelized function has been factored out to be importable\n",
    "    from feature_importances_mp import feature_importances\n",
    "    chunks = np.array_split(X.index, n_chunks)\n",
    "\n",
    "    for method in methods:\n",
    "        jobs = [{\n",
    "            'func': feature_importances,\n",
    "            'X': X.loc[chunk],\n",
    "            'cont': cont.loc[chunk],\n",
    "            'method': method,\n",
    "            'allow_masking_effects': allow_masking_effects,\n",
    "        } for chunk in chunks]\n",
    "\n",
    "        results = process_jobs(jobs, num_threads=num_threads)\n",
    "    \n",
    "        imps, oobs, ooss = zip(*results)\n",
    "\n",
    "        feature_imp = combine_imps(imps)\n",
    "        oob_score, oos_score = pd.Series(oobs).mean(), pd.Series(ooss).mean()\n",
    "\n",
    "        plot_feature_importance(\n",
    "            feature_imp, oob_score=oob_score, oos_score=oos_score,\n",
    "            save_fig=True, output_path='img/{}_feat_imp{}.png'.format(method, run)\n",
    "        )\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chunked_test_data_func(X, cont, n_chunks=10, run='_8.4c_10chunks')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Parallelized feature importance:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.4c_10chunks.png)\n",
    "![title](img/MDA_feat_imp_8.4c_10chunks.png)\n",
    "![title](img/SFI_feat_imp_8.4c_10chunks.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.4d\n",
    "\n",
    "Using the code presented in Section 8.6:\n",
    "\n",
    "Compute the stacked feature importance on the combined dataset $(X, y)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_func(X, cont, run='_8.4c_1chunk')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Stacked feature importance:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.4c_1chunk.png)\n",
    "![title](img/MDA_feat_imp_8.4c_1chunk.png)\n",
    "![title](img/SFI_feat_imp_8.4c_1chunk.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.4e \n",
    "\n",
    "Using the code presented in Section 8.6:\n",
    "\n",
    "What causes the discrepancy between the two? Which one is more reliable?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A: Both methods generate similar rankings, with informative and redundant features above noisy ones, while the more computationally intensive (stacked) does so by a much wider margin.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.5\n",
    "\n",
    "Repeat all MDI calculations from exercises 1-4, but this time allow for masking effects. That means, do not set `max_features=int(1)` in Snippet 8.2. How do results differ as a consequence of this change? Why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "X, cont = get_test_data(n_features=12, n_informative=4, n_redundant=4, n_samples=5000)\n",
    "Xdot = pd.DataFrame(get_orthogonal_features(X), index=X.index).add_prefix(\"PCA_\")\n",
    "test_data_func(Xdot, cont, '_8.5_1', allow_masking_effects=True, methods=['MDI'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.5_1_2.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xdotdot = pd.concat([X, Xdot], axis=1)\n",
    "test_data_func(Xdotdot, cont, '_8.5_2', allow_masking_effects=True, methods=['MDI'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.5_2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A: There is little change for the PCA-transformed features, while MDI seems to perform a lot better on the union of transformed and non-transformed features when allowing for masking effects.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "most_important_features = ['I_2', 'PCA_1', 'I_1', 'PCA_0', 'R_3']\n",
    "\n",
    "Xdotdotdot = Xdotdot.loc[:, ~Xdotdot.columns.isin(most_important_features)]\n",
    "test_data_func(Xdotdotdot, cont, '_8.5_3', allow_masking_effects=True, methods=['MDI'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.5_3.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 10000\n",
    "X, cont = get_test_data(n_features=20, n_informative=5, n_redundant=5, n_samples=n_samples)\n",
    "\n",
    "chunked_test_data_func(X, cont, n_chunks=10, run='_8.5_4', allow_masking_effects=True, methods=['MDI'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.5_4.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_func(X, cont, '_8.5_5', allow_masking_effects=True, methods=['MDI'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](img/MDI_feat_imp_8.5_5.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A: Allowing for masking effects still manages to rank PCA features correctly, however when untransformed redundant and noisy features are introduced, the feature importance methods quickly produce much worse results than when run when not allowing for masking effects. While stacked feature importance still does OK, parallelized feature importance also ranks some noisy above informative, and some redundant below most other features.**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
